Re-Store: A System for Compressing, Browsing, and Searching Large Documents
نویسندگان
چکیده
Mechanisms for compressing text have been studied for many years, and a wide range of effective methods has been developed. But compression of text makes it unwieldy in other ways – it must be decompressed before being viewed, and it is harder to directly search. One way of addressing these concerns is to build an index for the source document, and search via a query interface [Witten et al., 1999]. Then the passages sought by the user can be identified using Boolean or ranked queries, and only those selected passages need be fetched and decompressed. Another alternative is to use a compression mechanism that is amenable to compressed pattern matching, and undertake the equivalent of an exhaustive linear search in the compressed text to locate passages of interest [de Moura et al., 2000]. Again, only small fragments of the source document might be eventually presented to the user. In this presentation we consider a third approach, and describe a software compression and searching system – dubbed RE-STORE – that supports browsing within the compressed text based on phrases extracted from the text, and fast identification and decompression of the passages in the text containing those phrases. In our system, the user selects one or more terms of interest from a static list that is somewhat akin to a vocabulary derived from the document, and is then free
منابع مشابه
Re-Store: A System for Compressing, Browsing, and Searching Large Documents (Invited Paper)
Mechanisms for compressing text have been studied for many years, and a wide range of effective methods has been developed. But compression of text makes it unwieldy in other ways – it must be decompressed before being viewed, and it is harder to directly search. One way of addressing these concerns is to build an index for the source document, and search via a query interface [Witten et al., 1...
متن کاملManaging Personal Documents with a Digital Library
.This paper presents a desktop system for managing personal documents. The documents can be of many types—text, spreadsheets, images, multimedia—and are organized in a personal “digital library”. The interface supports browsing over a wide variety of document metadata, as well as full-text searching. This extensive browsing facility addresses a significant flaw in digital library and file manag...
متن کاملA MPEG-4/7 based Internet Video and Still Image Browsing System
The ongoing MPEG-7 standard intends to provide a “Multimedia Content Description Interface.” In other words, it will provide a rich set of tools to describe content with a view to facilitating applications such as content based querying, browsing and searching of multimedia content. The MPEG-4 standard provides tools for compressing multimedia content at bitrates that are feasible with typical ...
متن کاملInformation Retrieval over Multimedia Documents
While there are many textual and image retrieval systems, few have explored the granularity of the retrieval unit and the use of all available information for retrieval. This paper presents our work on using textual and image retrieval, fusing the results and providing document retrieval that uses visual and textual information from documents. A query re nement technique is also shown that blur...
متن کاملA collaborative faceted categorization system - user interactions
We are building a system that improves browsing and searching access to a large, growing collection by supporting users to build a faceted (multiperspective) classification schema collaboratively. The system is targeted in particular to collections of photographs and images that, in general, have few textual metadata. Our system allows users to build and maintain a faceted classification schema...
متن کامل